title: 'Data scientists salary'

author: 'Gamaliel Mendoza'

date: 25/May/25

This notebook performs an analysis of data scientists' salary as of 2024.

  • The analysis is based on:
  • The analysis procedure was:
    • Queries with SQLite3
    • Conversion to DataFrames
    • Statistical analyis:
      • One-way analysis of variance (for the first dataset)
      • Linear regression (for the 2024 dataset)
sol=0
jpl=0
bollyn=bool(sol)
if bollyn == True:
    print('Downloading')
    if jpl == 0:
        !pip install kagglehub huggingface_hub ipython-sql scipy
        print('Downloaded with pip')
    elif jpl ==1:
        print('Downloading with piplite')
        import piplite
        await piplite.install("kagglehub")
        await piplite.install("seaborn")
        await piplite.install("huggingface_hub")
        await piplite.install('ipython-sql')
        await piplite.install("scipy")   
        print('Done')
        
else:
    print('Libraries already downloaded')
Libraries already downloaded
#await piplite.install("pretty_jupyter")
import kagglehub
import warnings
import pandas as pd
import prettytable
prettytable.DEFAULT = 'DEFAULT'
import sqlite3
warnings.simplefilter(action='ignore', category=FutureWarning)
# Download latest version
import pandas as pd
import os
from matplotlib import pyplot as plt
# The two datasets are downloaded with with API resources
path = kagglehub.dataset_download("adilshamim8/salary-prediction-salary-dataset")
fls=os.listdir(path)
df=pd.read_csv(path+'/'+fls[0])
dfTemp = pd.read_csv("hf://datasets/hugginglearners/data-science-job-salaries/ds_salaries.csv")
# Tables and connections are created for the SQL tables
%reload_ext sql
con=sqlite3.connect('DS_sal.db')
cursor=con.cursor()
%sql sqlite:///DS_sal.db
df.to_sql("salary_prediction_updated_datas",con,if_exists='replace', index=False)
dfTemp.to_sql("ds_salaries",con,if_exists='replace', index=True)
607

Verify data presentation and manageability

# Query for THF's to check it's loaded
%sql Select distinct * from ds_salaries limit 5;
 * sqlite:///DS_sal.db
Done.
index Unnamed: 0 work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 0 2020 MI FT Data Scientist 70000 EUR 79833 DE 0 DE L
1 1 2020 SE FT Machine Learning Scientist 260000 USD 260000 JP 0 JP S
2 2 2020 SE FT Big Data Engineer 85000 GBP 109024 GB 50 GB M
3 3 2020 MI FT Product Data Analyst 20000 USD 20000 HN 0 HN S
4 4 2020 SE FT Machine Learning Engineer 150000 USD 150000 US 50 US L
# Query for Kaggle's to check it's loaded
%sql Select distinct * from salary_prediction_updated_datas limit 5;
 * sqlite:///DS_sal.db
Done.
YearsExperience Education Level Job Role Salary
14.98 PhD Marketing Specialist 77861.24
38.03 PhD Software Engineer 166880.38
29.28 Bachelor's Business Analyst 123573.27
23.95 Master's Marketing Specialist 90155.08
6.24 Master's Business Analyst 55931.41

General oveview (between 2020-22)

import seaborn as sns
yrs=pd.read_sql_query("select work_year from ds_salaries where job_title like 'Data Scientist'",con)
AllDS=pd.read_sql_query("select * from ds_salaries where job_title like 'Data Scientist'",con)
AllCs=AllDS.columns
Years=pd.unique(yrs['work_year'])
bxs=pd.DataFrame()
for y,yr in enumerate(Years):
    bxyr=pd.read_sql_query(f"select salary_in_usd from ds_salaries where job_title like 'Data Scientist' and work_year = {yr} ",con)
    bxs.loc[0:len(bxyr['salary_in_usd']),f'{yr}']=bxyr.loc[:,'salary_in_usd']/12
boxes=pd.DataFrame(bxs)

# Plot the orbital period with horizontal boxes
#sns.boxplot(data=boxes)

# Add in points to show each observation

import operator
places=[2,8,9]
getter=operator.itemgetter(*places)
cols=list(getter(AllCs))

dbsns=AllDS[cols]

fig,axes=plt.subplots();
#sns.swarmplot(data=dbsns,x=dbsns['work_year'],y=dbsns['salary_in_usd'],hue='employee_residence')
ax=sns.violinplot(data=bxs)
for violin, alpha in zip(ax.collections[::1], [0.8,0.6,0.4,0.2]):
    violin.set_alpha(.25)
#axes.plot(mng,color='k',)
sns.boxplot(data=dbsns,x=dbsns['work_year'],y=dbsns['salary_in_usd']/12,hue='employee_residence')
#sns.swarmplot(data=boxes)
sns.despine(offset=10, trim=True)
plt.ylabel('Salary USD')
plt.xlabel('Year');
# Improve the legend
sns.move_legend(
    axes, loc="upper right", ncol=6, frameon=False, columnspacing=1, handletextpad=0,
)
#Boxes=pd.DataFrame.groupby(by=TempDS['work_year'])
#TempDS.boxplot(column=Boxes,by=TempDS['salary_in_usd'])

Historically, Data Scientists earned:

DSt=bxs.describe()
samples=DSt[DSt.columns].loc['count']
means=DSt[DSt.columns].loc['mean']
DSt
2020 2021 2022
count 21.000000 21.000000 21.000000
mean 7164.210317 5680.119048 10687.309524
std 6786.387393 3936.512508 3295.198118
min 1805.750000 238.250000 3271.916667
25% 3516.416667 3152.083333 8725.166667
50% 5227.166667 5122.250000 10250.000000
75% 8750.000000 7561.166667 12500.000000
max 34333.333333 13333.333333 17625.000000

But, is it different?

from scipy.stats import f_oneway
f, p = f_oneway(bxs['2020'],bxs['2021'],bxs['2022'])
print(f"A comparison between the analyzed periods with ANOVA says it {'is significant' if p<0.05 else 'is not significant'}, with p= {round(p,4)}")
A comparison between the analyzed periods with ANOVA says it is significant, with p= 0.0052

And how much do they earn today?

%sql alter table salary_prediction_updated_datas rename 'Job Role' to Job;
%sql alter table salary_prediction_updated_datas rename 'Education Level' to Education;
%sql update salary_prediction_updated_datas set salary = salary/12;
 * sqlite:///DS_sal.db
Done.
 * sqlite:///DS_sal.db
Done.
 * sqlite:///DS_sal.db
100000 rows affected.
[]

As of 2025, Data scientists come from different degrees:

careers=pd.read_sql_query("select Distinct education from salary_prediction_updated_datas",con)
levs=[]
for carre in careers['Education']:
    vs=pd.read_sql_query(f"select * from salary_prediction_updated_datas where Education like '%{carre[0:3]}%' AND Job = 'Data Scientist' limit 1",con)
    print()
    levs.append({"Years of Experience":vs.loc[0,"YearsExperience"],"Degree":vs.loc[0,"Education"],"Job Title":vs.loc[0,"Job"],"Average month salary":round(vs.loc[0,"Salary"],2)})
T=pd.DataFrame(levs)
print(T)




   Years of Experience            Degree       Job Title  Average month salary
0                12.17               PhD  Data Scientist               5517.32
1                10.85        Bachelor's  Data Scientist               5412.88
2                24.04          Master's  Data Scientist               9775.08
3                37.96       High School  Data Scientist              13527.06
4                31.41  Associate Degree  Data Scientist              11628.92
# We need to extract the data from the table. 
# One way to do wo is by extracting the queries from the tables as shown below
import re
phds=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like 'PhD' AND Job = 'Data Scientist'",con);
bach=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like '%Bachelor%' AND Job LIKE \'%Data%\'",con);
mast=pd.read_sql_query("select * from salary_prediction_updated_datas where Education like '%Master%' AND Job LIKE '%Data%'",con);
allDS=pd.read_sql_query("select * from salary_prediction_updated_datas where Job = 'Data Scientist'",con);
allDS2=cursor.execute("select * from salary_prediction_updated_datas where Job LIKE '%Data%'")
allDS2=pd.DataFrame(cursor.fetchall())
allDS2.rename(columns={0:"Index",1:"YearsExperience",2:"Education Level",3:"Job Role",4:"Salary"});
DSs=df[df['Job Role']=="Data Scientist"]
DSs.loc[:,'Salary']=DSs.loc[:,'Salary']/12
Baches=DSs[DSs['Education Level']=="Bachelor's"]
Masters=DSs[DSs['Education Level']=="Master's"]
Docs=DSs[DSs['Education Level']=="PhD"]

Data scientists' income can be is expressed in a linear fashion

#import libraries
from matplotlib import pyplot as plt
import seaborn as sb
import numpy as np
import scipy
print('Libs uploaded')
Libs uploaded
#Plotted the Salary vs the years of experience

#Create a joint plot
dtpt=DSs
fig= plt.figure
j=sb.jointplot(data=dtpt,x='YearsExperience',y='Salary',ratio=10,hue='Education Level',marker='+',marginal_ticks=True)
x=np.array(dtpt['YearsExperience'])
y=np.array(dtpt['Salary'])
X=x[np.isnan(x)==False]
Y=y[np.isnan(y)==False]
minl=min([len(X),len(Y)])
X=X[0:minl]
Y=Y[0:minl]
# Obtained correlation coefficients for the data
corrs=scipy.stats.pearsonr(X,Y)
r,p=scipy.stats.pearsonr(X,Y)

# Obtained the  slope and intercept as a measur for initial salaries from (x,y)

def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)

    # mean of x and y vector
    m_x = np.mean(x)
    m_y = np.mean(y)

    # calculating cross-deviation and deviation about x
    SS_xy = np.sum(y*x) - n*m_y*m_x
    SS_xx = np.sum(x*x) - n*m_x*m_x

    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x
    return (b_0, b_1)

intercept,slope=estimate_coef(X,Y)
r,p=scipy.stats.pearsonr(X,Y)
#otro, = j.ax_joint.plot([], [], linestyle="", alpha=0)
#j.ax_joint.legend([otro],['y=mx+b; y={:3.2f}x+{:3.2f}, r={:.3}, p={:2.3e}'.format(b_1,b_0,r,p)])
#j.ax_joint.legend([otro],['r={:.3}, p={:.3e}'.format(b_1,b_0,r,p)])
plt.suptitle('Data Scientists\' Salary');
print(['Linear regression: y=mx+b; y={:3.2f}x+{:3.2f}, r={:.3}, p={:.4e}'.format(intercept,slope,r,p)])
['Linear regression: y=mx+b; y=2503.50x+299.87, r=0.99, p=0.0000e+00']

And they come from different backgrounds:

numofClasses=pd.read_sql_query("select Education, count(*) as freq from salary_prediction_updated_datas where Job = 'Data Scientist' group by Education",con);
numofClasses
Education freq
0 Associate Degree 3989
1 Bachelor's 4074
2 High School 4083
3 Master's 4012
4 PhD 3969

But everyone has the same chance.

TablaDeg=[]
for degree in numofClasses['Education']:
    base=DSs[DSs['Education Level']==degree]

    x=np.array(base['YearsExperience'])
    y=np.array(base['Salary'])
    X=x[np.isnan(x)==False]
    Y=y[np.isnan(y)==False]
    minl=min([len(X),len(Y)])
    X=X[0:minl]
    Y=Y[0:minl]

    
    r,p=scipy.stats.pearsonr(X,Y)
    intercept,slope=estimate_coef(X,Y)
    TablaDeg.append({"Degree":degree,"n=":x.shape[0],"Raise":slope,"Initial":intercept,"R\N{superscript two}":r,"P-value":round(p,4)})
Tabla=pd.DataFrame(TablaDeg)        
  
print(Tabla)
fig, ax=plt.subplots()
ini=ax.plot(Tabla['Degree'],Tabla['Initial'], color='gray',label='Initial')#,
plt.ylabel('Initial salary (monthly)');
plt.xticks(rotation=15);
slp=plt.twinx()
rs=slp.plot(Tabla['Degree'],Tabla['Raise'],color='Blue',label='Raise')
fig.suptitle('Salary per education level');
plt.ylabel('Raise/year');
#plt.legend([ini,rs],['Initial','Raise'])
             Degree    n=       Raise      Initial        R²  P-value
0  Associate Degree  3989  299.225118  2523.878898  0.989961      0.0
1        Bachelor's  4074  299.434688  2514.954747  0.989211      0.0
2       High School  4083  299.968767  2500.494866  0.989780      0.0
3          Master's  4012  300.334733  2488.717294  0.989846      0.0
4               PhD  3969  300.410188  2489.167711  0.990213      0.0

Take-home message

  • From 2020, Data Scientists have been better paid year after year
    • How would it be related to different sector in the actual highly-advanced technological world?
  • The education level at which data scientists start their career is not as important as one might think.
    • Everyone starts at the same point.
      • Some of them earn ~$30 less than the others at the beginning.
      • And they increase their salary at the same rate.
try:
    !jupyter nbconvert dsp001.ipynb --to html --template pj
except Exception as e:
    print('HTML not stored')
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['PDFviaHTML', 'asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'pdfviahtml', 'python', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['PDFviaHTML', 'asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'pdfviahtml', 'python', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.